Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45190][SPARK-48897][PYTHON][CONNECT] Make from_xml support StructType schema #47355

Closed
wants to merge 2 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jul 15, 2024

What changes were proposed in this pull request?

Make from_xml support StructType schema

Why are the changes needed?

StructType schema was supported in Spark Classic, but not in Spark Connect

to address #43680 (comment)

Does this PR introduce any user-facing change?

before:

from pyspark.sql.types import StructType, LongType
import pyspark.sql.functions as sf
data = [(1, '''<p><a>1</a></p>''')]
df = spark.createDataFrame(data, ("key", "value"))

schema = StructType().add("a", LongType())
df.select(sf.from_xml(df.value, schema)).show()

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
Cell In[1], line 7
...
AnalysisException: [PARSE_SYNTAX_ERROR] Syntax error at or near '{'. SQLSTATE: 42601

JVM stacktrace:
org.apache.spark.sql.AnalysisException
	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:278)
	at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:98)
	at org.apache.spark.sql.catalyst.parser.AbstractParser.parseDataType(parsers.scala:40)
	at org.apache.spark.sql.types.DataType$.$anonfun$fromDDL$1(DataType.scala:126)
	at org.apache.spark.sql.types.DataType$.parseTypeWithFallback(DataType.scala:145)
	at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:127)

after:

+---------------+
|from_xml(value)|
+---------------+
|            {1}|
+---------------+

How was this patch tested?

added doctest and enabled unit tests

Was this patch authored or co-authored using generative AI tooling?

no

@zhengruifeng
Copy link
Contributor Author

also cc @sandip-db

nit
@@ -16303,7 +16303,21 @@ def from_xml(
>>> df.select(sf.from_xml(df.value, schema).alias("xml")).collect()
[Row(xml=Row(a=1))]

Example 2: Parsing XML with :class:`ArrayType` in schema
Example 2: Parsing XML with a :class:`StructType` schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this. Can you please reuse the existing jira #SPARK-45190?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I was not aware of that ticket, will also add it to the title

@@ -16303,7 +16303,21 @@ def from_xml(
>>> df.select(sf.from_xml(df.value, schema).alias("xml")).collect()
[Row(xml=Row(a=1))]

Example 2: Parsing XML with :class:`ArrayType` in schema
Example 2: Parsing XML with a :class:`StructType` schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you please enable tests here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@zhengruifeng zhengruifeng changed the title [SPARK-48897][PYTHON][CONNECT] Make from_xml support StructType schema [SPARK-45190][SPARK-48897][PYTHON][CONNECT] Make from_xml support StructType schema Jul 16, 2024
@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the from_xml_struct branch July 16, 2024 05:54
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
…tructType schema

### What changes were proposed in this pull request?
Make `from_xml` support StructType schema

### Why are the changes needed?
StructType schema was supported in Spark Classic, but not in Spark Connect

to address apache#43680 (comment)

### Does this PR introduce _any_ user-facing change?

before:
```
from pyspark.sql.types import StructType, LongType
import pyspark.sql.functions as sf
data = [(1, '''<p><a>1</a></p>''')]
df = spark.createDataFrame(data, ("key", "value"))

schema = StructType().add("a", LongType())
df.select(sf.from_xml(df.value, schema)).show()

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
Cell In[1], line 7
...
AnalysisException: [PARSE_SYNTAX_ERROR] Syntax error at or near '{'. SQLSTATE: 42601

JVM stacktrace:
org.apache.spark.sql.AnalysisException
	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:278)
	at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:98)
	at org.apache.spark.sql.catalyst.parser.AbstractParser.parseDataType(parsers.scala:40)
	at org.apache.spark.sql.types.DataType$.$anonfun$fromDDL$1(DataType.scala:126)
	at org.apache.spark.sql.types.DataType$.parseTypeWithFallback(DataType.scala:145)
	at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:127)
```

after:
```
+---------------+
|from_xml(value)|
+---------------+
|            {1}|
+---------------+

```

### How was this patch tested?
added doctest and enabled unit tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#47355 from zhengruifeng/from_xml_struct.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…tructType schema

### What changes were proposed in this pull request?
Make `from_xml` support StructType schema

### Why are the changes needed?
StructType schema was supported in Spark Classic, but not in Spark Connect

to address apache#43680 (comment)

### Does this PR introduce _any_ user-facing change?

before:
```
from pyspark.sql.types import StructType, LongType
import pyspark.sql.functions as sf
data = [(1, '''<p><a>1</a></p>''')]
df = spark.createDataFrame(data, ("key", "value"))

schema = StructType().add("a", LongType())
df.select(sf.from_xml(df.value, schema)).show()

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
Cell In[1], line 7
...
AnalysisException: [PARSE_SYNTAX_ERROR] Syntax error at or near '{'. SQLSTATE: 42601

JVM stacktrace:
org.apache.spark.sql.AnalysisException
	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:278)
	at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:98)
	at org.apache.spark.sql.catalyst.parser.AbstractParser.parseDataType(parsers.scala:40)
	at org.apache.spark.sql.types.DataType$.$anonfun$fromDDL$1(DataType.scala:126)
	at org.apache.spark.sql.types.DataType$.parseTypeWithFallback(DataType.scala:145)
	at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:127)
```

after:
```
+---------------+
|from_xml(value)|
+---------------+
|            {1}|
+---------------+

```

### How was this patch tested?
added doctest and enabled unit tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#47355 from zhengruifeng/from_xml_struct.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants